DOCUMENT RESUME 



ED 211 935 



/ 

CS 006 442 



.AUTHOR 
TITLE 



\. INSTITUTION 

PUB DATE 
NOTE 



t 



Fisher> Donald " 

Assessment of Reading Competencies. Literacy: Meeting 
. the- Challenge. .* 
Office oi Education (DHEW), Washington, D.C. Right to 
Read Program. . • 
t 80 \ V 

33p.; Paper presented at .the National Right to Read 
Conference (Washington, DC, May 27-29, 1978). For 
'related documents see ED 190 997 and CS 006 - 
'443-449. 

AVAILABLE "FROM* Superintendent of Documents, Government Printing 
* . ' • *• '.Office; Washington, DC 2p402. 

MF01/PC02 Plus Posjtage. 

♦Literacy; Reading Achievement; *Reading Instruction; 
Reading Programs; *Reading Tests; *Standardized 
Tests; Test Construction; Testing Problems;. *Test 
Validity . * 



EDRS PRICE 
DESCRIPTOR'S 

/ 



ABSTRACT 

part of a set 
Read Coafefen 
examination o 
first .offers 
£he criteria* 
validity,' con 
validity. , The 
achieving val 
education., pro 
testa'.in thei 



Xlje^tirst of eight re 
ies .of papers presented 
ce .examining issues and 
f standardized reading c 
a definition df^t,he kind 
the test must satisfy to 
struct validity, concurr 
paper reviews existing 
idity' according to each 
fession should not conti 
r present form. (HTH) „ 



lated documents, this booklet is 
at the 1978 Natiorfal Right to 
problems in literacy. In its . 
onipexency tests, the booklet 
of te.st it. win consider and 
be deemed ^lid: % content 
ent va^iditiy, /and predictive 
tests arfd. o££#rs' approaches to^ 
qriterionV It .Concludes that .the 
nue adminisjtef in^ standardized 



/ 



/ v 



» * 

t *****************************^*************************** , 

* Reproductions supplied by EDRS are the best that can be made * * 

* • . from the original document. i 7 * 



******************************** ******************^*^* ? *^** 

ERIC • ' - , . • ▲ • 



. UA. DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 

EOUCATIONAT resources information 

s y CENTER iE$j^ 

/TTh» d£ument ha* beaPfeproduced « 
/ recervedNrom the^per&on *&> organization 

ongmabng ft t 
j Wmof changes have boon made to imffrofa 

reproduction qufcfity 

• Potnts of \new c/op*nom stated m thts docy 
ment'do'not necetsarty represent official NIE 
position or pokey 



f « . LITERACY: 

MEETING THE CHALLENGE 

Assessment of Reading Competencies 
* . - 

* Donald Jfisher 5 * 



• J 

r ■ 



•IT / 



2 . 



The material in this booklet was prepared* pursuant to a 
contract with the Right to Read Program, U.S. Office * 
of Education, Department of Health, Education, and 
Welfare. Contractors undertaking such work are 
encouraged to express "freely their professional :judg- 
i - ments. The content dbes not necessarily reffect Office of 
Education policy or views. 

The material in this bdoklet was'presented at the Na- 
tional Right lo Read Conference, Washington, D.C., 
May 27-29, -J978. The'material was edited by the* staff 
of the National Institute of Advanced Stftdy which con- 
ducted the Conference under contract from the U.S. 
Office of Education. 





ERIC 



* 



FOREWORD 

A major goal of the Right to Read Program has been to disseminate 
information about the status of literacy education, successful products, 
practices and current research finding in order to improve tt>e instruction of 
reading. Over the years, a central vehicle for dissemination have been Right to 
Read conferences and seminars. In June 1978, approximately 35CT Ffight to 
Read project directors and staff from State and local education and non profit, 
agencies convened in Washington, D.C ttf consider LUeracy. Meeting the 
Challenge. ¥ 

\ N 5 

Thfc conference fqcused on three major areas: % i 

^ • examination of cu/rent literacy problems and issues 
v * 

• assessment of accomplishment^ and potential resolutions* regard- 
ing literacy issues; and ' 

■? 

• exchange and dissemination of ideas and materials on successful 
practices toward increasing literacy -in the United States 

All levels of education, preschool throCigh adult, were considered. 

The response to the Conference ua$ sup h that we have decide^ to publish the 
papers in a series of individual publications, AdditionaUitles in the series are 
listed separately o$ well as directions for ordering copies. 

♦ 

• * 

% Shirley A. Jackson , 
Director * 
» * Bas|c {Skills Program ^ 



LITERA-CY MEETING THE CHALLENGE . 

* " ♦ * 

A Serios of Patpers Presented at the 
% National Right to Redd Conference 

j .May 1978 

Assessment of Reading Competencies 7 - . 

Donald-Fisher . ' A 

* 

How Should Reading Fit Into a Pre-School Curriculum 
Bernard Spodex * . 

Relating Literacy Development tp Career Development 

Allen B. Moore % \ 

Private Secto^Inyertvement in Literacy Effort? 

•The Corporate Model for Literacy Involvement" 

Lily Fleming ^ § N ~ . % 

"Reading Alternative: Private Tutoring Programs" 

Daniel Bassill 

"Building Intellectual Capital. The Rol£ of Education in Industry" , 

Linda Stoker - # • * 

* 

Who is Accountable for Pupil Illiteracy? ' « 
Paul Tractenberg * • 

Publishers" ResponsibilitieSstrr Meeupg the Continuing Challenge of Literacy s 
Kenneth Komoski 

Can Public Schools Meet the Literacy Need^of the Handicapped? 
Jules C. Abrams $ * t 

4 ( 

The Basic Skills Movement: Its Impact on Literacy ' • 
♦ • Thonjas Sljcht. , * 

Literacy. Competency and the Problem of Graduation Requirements 
William G. Spady * 

» . * 

Projections In Reading - 

"Teaching Reading in the Early Elementary Years" 
Dorsey Hammond 

" Adult Literacy" . ' ' % ♦ * 

Oliver Patterson 
"Reading Programs: (Srades Seven Through Twelve" \ 

Harold Herb«r % t " 

■ * 

iv y • 



SUMMARY % 

Overview V * * 

For the past fifteen years the validity of standardized tests, including those 
purporting to measure readin$achievement, has been frequently called into 
"question. Beginning with a definition of the kind of test it will considerand the 
criteria the test must satisfy to be deemed valid* this paper reviews existing 
tests in the light of each criterion in turn. It then offers approaches to 
achieving vafidity, again according to each criterion. It concludes that the 
profession should not continue administering standardized tests in their 
present form. ' * ■ 

• « * 

Definitions * * ' 

i * - ■ 

Asserting that validity depends upon situation; *he author'defines as the test 
he will consider a general measure of children's functional literacy skills that 
will be used both for purposes' of accountably and to identify minimal 
reading competencies. He theo introduces the four criteria that test must 
satisfy- content validity ^construct validly, concurrent validity, and predictive 
validity. > 

* • - r * ' 

ThreatsJp the Validity of Present Tests 

No one ty£fe of instrument should bear the brum of criticism because aH« 
four in present use fail to satisfy the criteria that defifie\alidity, The manner in 
which norm-referenced tests are constructed virtually precludes content 
vgjidity. Objectives-referenced tests also fall short of content validity, first 
because no substantial evidence jinks the sub-skills measured with the skills 
required for functional literacy, second> 6ecause literacy skills' themselves 
have not beeriTirmly established, and third, because the domain from which 
test ^ems. derive is undefined. The. multiple-choice^ formafcommon to 
^standardised tests make* construct validity virtually unattainable because it 
does not allow students to give their reasons fowrhoosing an answer and hence 
provides' no assurance that their errors result from* deficiencies in the skills 
that the questi&ns intend to measure. The meails for cteterrmning concurrent 
validity remain incomplete; in default of a reliable cut-off score, there is no 
assurance that any test identifies, a^l and only the masters of the. skills: 
measured The author bypasses the problems W determining predictive 
validity, first, because studies of existing tests rarely ^discover significant 
correlations between scores and p^rfarmancc in later life and second, because 
the tests "studied were not constructed to predicf.what the tesj^ under 
consideration should predict.' adult success in< life. " " * 




Approaches to Improving Validity 

Drawing upon approaches proposed and actuall> taken, theteuthor offers 
methods of constructing junctional literacy measures thai would meet the 
objections set forth above. To assure that items are representative of the 
appropriate domain, i.e., to achieve content validity, it would be possible to 
adopt the approach taken b> Arm> researchers, who pertained what 
materials and for whatjpurposes soldiers read in connection* wit]) .their work 
and, from this information, defined job-related reading l^slgfc More 
practicably, it would fre possible to ask content specialists to r<^tetr\£ relevance 
of items to the domain or to ascertain whether identicall> informed item 
writers could construct equivalent te^ts. Acknowledging that steps to improve 
construct validity, though many, are neither so Simple nor so attractive, the 
author confines himself to arguing th4t students taking multiple-choice tests 
must have the chance to" explain their answers. To arrive at a reliable cut-off 
score, va prerequisite for establishing concurrent validity, minimum passing 
levels for each item cbuld be determined b> expertsund then totalled/ The 
author refers ^to statistical techniques that rmght correct for measurement 
errors tfiat result iq misclassification. Finally, while acknowledging th'e 
naivete of trying to correlate any set of items to success qt failure in life, the 
author 'presents two .approaches to improving predictive, validity, both 
illustrated by Jhe Adult Performance Level Project 



^ Conclusion • 5 

Present standardized tests risl^ rtusclassifying students and hence not only 
S^ll to aid diagnosis and remediation, but perpetrate injustice. Since we cannot 
claim ignorance of the problem, we have a duty to confront it. 



f 7 

* 

2 




ASSESSMENT OF READING COMPETENCIES 
• * • • ♦ i 

Introduction « ♦ 

^ Ladies and gentlemen, it is indeed an honor and a pleasure to be here toda^v 
. I hope 'to make our next hour as enjo>able as it Is instructive.Jo this end/l 
have left out details which should perhaps have been included. Worfe yet, I 
ma> have included too man> tired*or worn out issues. If I do not succeed for 
some of you, I trust >ou will understand it is not from a want of effort. As most 
of you are aware, the focus of this ^ornfng's talk will be on measures of 
reading achievement. In particular, we will ask whether the measures of 
reading achievement really identify what we claim they measure. And if the 
measures fail to identify what is claimed of them, then we will ask what can b/ 
{ione^o improve them. So much for my prefatory remarks. 

We as educators, 'we as parents, we as.students, we all are no longer 
innocent. These are perhaps harsh words to begin a talk with. But I believe 
they are justified. A little over fifteen yjpars ago a book was published which^ 
created Some controversy. The book,. apUy titled The Tyranny *of Testing 
(Hoffman, 1962), was the beginning of an end to our innocence. Traditional 
standardized tests were vigorously criticized by the author of this book. The 
producers of standardized tests were quick td retaliate (see for example, 
"Explanation of Multiple-Choice Testing, M 1961). The controversy raged for 
awhile, but then seemed to fade. Perhaps many hoped it were no more than a 
tempest in a.teapot. However, matters heated up again in 1972. The National 
Education' Association passed a resoluiiori^calling for a moratonum on 
stahej/rdized testing. The National Association for the Advancement \{ 
Colored People issued a similar statement in the sprftig of 1975. The 
Association for Supervision and Curriculum Development and American 
Association of School Administrators, whikr not sailing direptly for a/ 
-moratorium, have used strong language about the need to reconsider uses of! 
'Standardized te its (Pcrrone v 1977). ' » * 

, The debate /as been pursued at length m many of jhe*most respected 
journals in education. Phi Delta Kappan devoted thh> month's issue (May ■ 

to the use of standardized tests (see articles by BrickdC 1978; Caweltr, 
. Frener, 1978, Glass, 1978, Nathan and Jennings, 1978, Popham, 1978). 
Today's Education Sevoled their 'March, April^issue to the problems, 
surrounbing standardized testing (see articles by Engel, 1977, McKenna,, 
1977, National Education Task force on Testing, 1977, Tay^r, 1977). Th$* 
journal, National Elementary Principal, devoted two issues in 1975 to 
standardized testing. Yet, standardized tests are perhaps more in use today 
than they ever were before. 



• . r. 



/ 



The litan> could continue*for days, perhaps weeks. ForUinatel>, we can 
stop here without doing damage to the point,betog made. Again,' we ^re.no 
longer innocent^ We Calfinot easily claim to be ignorant of the existence of the 
controversy surrounding the use of standardized test's. We have been literally 
besieged vlith arguments both for and against the use of such measures. But, 
knowing that such arguments exist is one ,thing. Knowing which of these^ 
arguments to believe is quite another, f cannot Hope to offer ^ complete 
treatment of the t argu,ments bandied about. I do not pretend to be, unbiased. 
t But I can hope to present a few of the more salienl issues as clearly ancfas' 
Torcefully as possible, issues which bear repeating if already known, issues 
which deserve ^ hearm^if unfamiliar^So, we now turn to a discussion of the 
major criticisms levelled against the use of standardized tests. In particular, we 
Mill fpcus on those criticisms which bear in one way or another on the 
purported validity of standardized tests of reading. . ' 

¥ Accounta6ility, Validity and Minimal Competencies 

At least four criticisms have been raised against standardized tests of 
reading achievement. The content validity of the tebts,, the construct validity, 
. # tf>e predictive validity and the concurrent validity have all received theirshare 
of criticism. In order to expand on thes*e criticisms we need a m^re detailed 
description of the villaiti and a more detailed description of what we hope to 
accomplish with our measure of reading achievement. The villains are < 
standardized tests, but certainly not all standardized tests'orallaspeVtsof any 
one standardized test need be villainous. The goal of our testing is a vahd 
measure of reading competencies, but certainly some competencies are 
important in one situation, ind not important in others. Th^se points ctyinot 
be made too strongly . A test is valid only in a certain context or situation. So, a 
test valid in one situation may not be valid in another situation,(Anastasi, 
** 1965). The criteria for validity may .themselves conceivably change from 
situation to situation. . « 7 « 

Thus, it is only proper to a^k me at this point just what sort of situation \* 
have in mind /First. 1 am assuming fhat one is looking for a general measure of 
children's functional literacy skills. Second, 1 am assuming that the measure is 
tec, be used for purposes ofaccountability. And third, 1 am assuming that the 

, Measure will be .used to identify what have been called minimal reading 
Competencies. Interest in functional literacy, accountability and minimal^ 

' competencies have more ox less gone harfd in h^nd. The interest iscentcredat 
all levels of government, national. State, and local. ( Fisher, 1978) As of March 
15 of this year,. 33 States had taken some action 4o mandate^the setting of 
minimum competency stantj^tds for elementary and secondary students. 
(Pipho, 19.78) And many of these states are using measures of functional 
literacy as standards of accountability. Typical functional literacy items 
Hiclude thoseused on a test given by the EducatiooalTesting Service (see 
Figures 1 through 5). The stem for the items are, in order; 

* / 

•* • • '4 ' . , ' 

ERIC y 



' I. Place a circle around the bottle of liquid that would be safe to 

„ ■ drink. , _ 

^ 2. Look at the tr^ain schedules. Put a. circle around the time the 
*• - daily train leaving Trenton at 4t46 p.m. arrives ir> Washington./- 

3. Put a circle around the label that Would be the best one to putl 
ona box uSied- to mail something easily broken. 

4. Lb6k at the garment tags. Circle the two tags that indicare the 
e * , "garments are made from 100 percent Polyester. 

5. Look at the application for employ ment. Put an X in,the space 
where you would write the names and addresses of someone to 

N . - notify in case of-an emergency. . * r 

"With the content and purpose of the m.eaS-ure in hand, we can go on to 
delineate the criteria which such a measure must satisfy in order to be 
considered valid. First, the criterion of representativeness or contenf validity 
must be satisfied. The materials on the test should be representative of the 
materials it is thought important for the student to read. And the quest^ns 
asked on the test should be representative of the sorts of questions it is thought 
important for the styiient to answer. (See Bormuth, 1973, for a much more 
complete ex posit kSn.) 3dcqnd, the criterion of*fairness,or construct vahdity 
should be satisfied Children should 'pot be penalized for>a defensible answer, 
even though this arftwer dev iates from the one originally thought cqrrect. The 
exi4ience of defensible answers is much more ubiquitous than one might at ' 
first imagine if one can believe w hat one reads. There are other aspects to the 
criterion of fairness. These aspects need not be'mentioned until later. Thirds 
' the criterion of present relevance or concurrent validity must be satisfied. In 
the present case this means that the measure must be able to differentiate 
between masters and nonmasters of functional literacy {kills. And fourth, the 
criterion of future relevance or p/edictive validity must be satisfictf^ln the 
£ present case this means that later in life the masters must possess the minimal 
competencies needed to weave their way through the warp-and woof of daily 
existence. Converse^, the nonmasters. must not have the skills needed to 
function at some minimunt level in society*. Type 1 measures will be.used as a 
shorthand to refer to instruments which meet these standards of validity. 

The situation has been well-established. That is, the measure of reading 
achievement is to be used for the purpose of accountability. It remains to 
identify the villain. 



Four Criticisms of the Villain 



I measures 



The villain is not any one sort of instrument. CritenorWferenced 
of reading achievement objectives-refereftced measures of reading achieve- 
ment all can be improved upon. If there is one message which^outake home 
with you at the end of this hour or so it is just this. Labels alpne do not make a 



ERIC . xq 





• 


- 














few Yen- 


Ml 1 * 7 














• 


















177 m u.tr-u. 1 






5 00 






- 


131 ^ i^. 


t JO A- 




• 00 


1 


tMO 


i iU ! (J9 i ~ 


) 1020 AM 




101 mww —»—.«— 7 jo am 


1 J4Z _ , 


— 




t 43 


1 — s 49 1 HO 15 


' lfr2S *- 




:U » 


I 00 *- 


i:* is» 






9 JS 


1 to U u :t i — 






:sj o**» 




it*:. 9 a 1 


— 




9 4« 


1 ^0 t5 i IC 57 | — 


20*«4 




lis i-r 


9 30»«» 


, 9 •* ' iC Jt 


HOC 




I' 10 


f 1 1 45 1 12 44 I — 


tJOm t 




IS' 5*-» ' 


' ID-4S*- 


1S?I 1 


1232 




12-21 


! l£5l 1 14! i — 


2.40^ 1 


* 


Ivi liny Qii 




ut*i" .12.11 * , 


_ 


1 


12.41 


f «! tO ) 147 | — 


■ W« 1 f 


* 




t2.4f.MI 


wt 0*0£ 






113 


t (^Z-42\£ 2 At i i *-Zl 


• I4]M V 


t 




; oc^w 


i .>:2 — 


— 


t 




f 2.3t - I i 147 


4 00m 1 








: >• 3-90 


J 21 


1 


3 40 


4-09 5 10 r — 


How 




I 72 5— » 


300mi 


3 it \ 4 CO 


4 2J 




4 J9 


SOt ».09 r -» 


t SO ~* • 






4 :&mi 


I 4 j; s-03 








—I 11! ? ' ,' 


7 15 w * 




!«J #-m u 




"•5 I 5 21* 






t :j 


* 3 ' 4-J | | )4 








* M-y »~«_« kiOON 


_ _ . 


_ 






t 7 1 2 1 7 40 


7 35 w' 1 








LO 1 \ t 4 1 


* 15 




7J4 


74J «1 — 






lit. «-r 


i 39 mi 
t 30 m, 


t4t 7 29 


> V 




t<7 


1 tOt 9-40 » tfo&t 


102^^1 






7 30m. 


7 4* | 29 - 


l-57 v 


Ut 

>-ot 


t-$* . 1000 110-40 
♦J9 10 40 , — 


10-15 Mi 
11.20 mi 




I. J • 1 


— — — I-JOm* 


,t42 _ 






94} 


lo.-c* ixcu ; HU*> 




* 


»47 ^ 


7-CO mi 


9 !S / 


10-4 1 




JIOI 


*l-2» i', 37 . — 


' * U5*- 






10-00 m. 


^.a it r «» 


. . 12 




i. 4t 1 


.2 22 ! yi • — 


, 2,5 .« 


r 




- ^ 4 * 

— few Ycrk 
















1 "at""' *~ V*. ' , " c*-' 




—^7 ; , > ■ ■ 








t 2 2S — 


- i j :s i < .4 




1 4 SO 




< 1 21 iM ) i I 5 ' 


. '00*44 




: 7Z z**t 


" v t IS — 


- *2t» c >• 




1 0 




9 2^ V SO , '0 31 < 


10 55 *»4 t 






— — 7 * s ? 


40 1 St 151 


1 


9 .$ 




— ', no j« f 


10 29** 






i jo *- 


- *C2 ,4. 


1 


tO 13 




-* .029 , 111 Jt ! 


11 JO a*<4 




25 i— r 




f 


iC SI 




.5 2' «5 12-12 i 








'.C30*- 


- i 15 45 , t4! 
4 




:2 14 




1221 .?5i U5S t 


1 SO mi 




s ~TT^ V 




us 




:oi 




I iS \ J 45 J JO 


3i5M» 












: «:2 




— Jt" \j 4t 


IOC M» » 






.-00 m. 


.-la :.? 




2 41 




— 10V 1-141 


4 00 MI 






t 40 Ml 


2.21 1 U 




* 00 


^ 'IfciO 4 5 24 


5 40 mi 


* 




J 50 mi .. J 


1« J *5 4 45 




5 13 




5 21 5 5 t>4 


t 50 mi 




52 5-. 


« OO i 4 


.4 1*4 1 SI 




« it 




* Jt I t 55 7 «0 


7!5m, 






< ^0 M» ^- 5C1 » 47 




i .3 




— tj* * 7 I? 


t -30mi 






5 CO mi f/_ S 4i *S0 




1 > 




* 27 7 5* — 1 40 


145 m. ' 








t • 




43 




— ' 1 ^ 1 I 44 


900 Ml 


V 


— 


" ' i 95 m. 


• | »4\ O" 55. 




1 » 




1 « • i> 9 is 


\Z 50 M» 


I* 


:» ... , 


• 29 m. ' ■ 


- . r 


V 


> 4? 




• 51 0 1' , . 20 


. 15 Ml 






- — -~ i:;^ 

ji - * 


• i: *47 




■ 0 .i 




— :s ;» ii. ,t 


.. 10 Ml 
















: ♦ ; u 


: 4<-. 








i — 

L 


















YKA. Rt 2 


* 


/ 









Book I Item 13 

measure valid f his message is simple enough, but it toooften gets lost in the 
rhctorif of the testing co.ntroversy. A test must be measuged against the foiir 
types of validity we set out And how do present day measures stack up a^amst 
• 'these cntcna'ltAn atternpt to answer this question fQllpws immediately! We. 
will look separately at content -valicl% % const rut,j validity, concurrent validity 
,and predictive validity. f 




X2 





CONFIDENTIAL 
V, : — 




•PERISKA8CE— KEEP REFRIGERATED 



AIR* PARCEL ,POSJ 



Figure 3 
Book 3: Item I 



> V 



^ '^Content Validity 



" » I spojce earlier of thexriterion of representativeness or content validity, ji 
' wilt be argtfcd here "that both nonn.referenced measures and objectives- 
: referenced measures pose threats to thejcontent validity of type I tests. First, 
consider ijornwefereflced tests. We can, in fact, say something very strong 



* 0 * * 



. f ERJC .: 



13 




AjtNElr is a ligjnsed trademark 
winch can he iried'to identify 
Celanese Tn^cftate Fibers on 1/ « 
in fabrics of*Srpven performance . 
CeUfese tatfefVpries continually^ 
evaluate fabrjjs of the same mft 
as in this* garment against nflrous 
performance {randards. the fa*b* 
ric styles .'aijjomeet 'these st*n- 
>dards. the 7*^fctate fibers cannot 
be »oeniified arARNcL 



Mactyne'waSn - gsntle cycle • 
warm water 

• Tumble dry. wsnn setting t 
•Steam iron a< >T*edi\jm / serting 

• Wash dark colors or prints 
separately. . 

• Wa$h pldated gjrm^s by hand 
drtp-dry - ^o npt wfmj • 



(CO^TEXT^D P0LYES7H* 

•Macmne wAn and Dry-wasn and wear cvcie 
•Will Not SKfKK or Stretch 
*W?miUe Resistant - * * 

•NEEDS ftQ IRONING 
•tX5 CLEAN 
?ALISaB£ ^WfTTINC VULLS, Gfl/TTENBESC 



* # W*5HINC INSTRUCTIONS 

Wi$h by machine" at low temperature Drv^ 
oy m^ouAt at low temperature dr hang 'on 
line l(iioning is.desired. use cod! iron only 

*.<pVANTACES 

1. ,SmoothV:ofnfon ideal fdt* outer* ear or 
next to skin • 

2. Excellent dimensional stability 

3. Dehgnrful warmth with light weight 

4 zasv'care. excellent wasnabiliiy 

5 Viothpfbof and mtldewproof * 



The irregularities and variations in the 
weave of tlis 4 fabrc are Characteristics 
of thtsfabrtc and are in.no way 
•o be' considered^ tfc!ec: * 



•v 



Figure 4 
Book 4: Item 9 



t about the contenj; validity of norm-referenced measures \tfjfcan say that'in 
prinriBl£-mia*^ measures are designed to ideiujj differences in? 

abilhy or achievement between students. Therefore, item* will be excluded 
from the instrument whkh Tail to, dif{erentiate' betweer^ ftfce^t s. ^ln the 
extreme case, an item ifay be answered correctly by evBifiPphls iterft 
would inevitably be, jettisoned (Popham, 1974). If everyon«TOW^s the item 
correctly, then there is no reason (o include it On a norm-refejenee <f measure. . 
In general, it is dlear that such, a procedure willjiot lead to a representative 
sample of behatfors or items. Items are systematically excluded and included 
on the basis of their difficulty, not on (he basis of their representativeness 



ERIC 

air tiiiTTii n 




» 

APPLICATION FOR £uPLOY4£H T , O^Z / 



















/ 


- 


Zip ccce < ( 






*>£ Gr« *" — 



































CCAT On CP 5C-CCL *»C*<| *C 












— ^ — j-"- 








1 






'* " Figure 5 

Book 5: item 5 ■ , 

Some members of theliidienee may accept the technical point, yet find it 
rather unexciting, lacking any real oomph. Hopefully an^xample can give life 
lo the point. Figure 6 cofljams three sample items which might appear on a 
pilot test of reading achievement in- the 6th grade. Suppose one finds that over 
99 percent of the students answer the first item correctly, roughly 50 percent of 
the students answer the second item correctly , and fewer than one percent of 
the students answer the third item correctly. In general, one would not expect 
to find the' Crst and third ifems appearing op the final ve-rsioTi otthe test. They 
would be excluded because they fall to differentiate between students. Vet 
they arc cJearly important items,' By systematically excluding very easy and 
very difficult items one may well be excluding those items, those reading 
behaviors, which one would mo>t like to see measured. 

It has been argucdf*hat the tontent validity of type 1 measures is threatened 
by the majiner in which norm-referenced tests are constructed It is also the 
case that the content validity of type 1 measures^an be defined as those tests ^ 
which specify jn great cfetatl all the reading competencies or behaviors the 
individual is thought to need. Some te^s'have specified upw'ards of 350 skills. • 
Global tasks are dissected, placed ufjder the microscope, and claimed to 
consisUof so many minute behaviors. In principle, such an endeavor is 
laudable enough. In practiee, such cfforts4urn out to be rather dreary . This is 
so for at least two reasons. 

I ■ 

First, there, is little, if any, evidence (which links the specific £i)b-skills 
measured on objectives-referenced tests with the reading skills one might neejj 
as a functional literate. Indeed, this is. rjpt surprising. F unctional literacy is 
very much a part of literacy. And the specific skills needed to be literate, to 
underhand what one has read, have never been well-defined. (Farr, 1969) 
TJiufc, the set of skills selected for testing on objectives-referenced measures 
may well opt be representative of the larger set of skills needed for functional 
literacy. This point ojjjiew receives further support when one goes so far as to 

/ I0 V - 

■IS . 



analy/e.the errors made on tests of^comprehejision. Particularly instructive is 
the analysis of errors in the functional literacy measure administered by the 
Educational testing Service, ({rfurphy, 1975) There were two phase^to this 
analysis. Jn the first phase, the* answer f booklets from a previous 
administration of the functional literacy test ^reexamined. Fully 85 percent 
ttrf the incorrect r&ponses could not be categorized. If the relation between 
reading coffi^etency sub-skills and general reading * behaviors were, 
transparent, one would expe'et to find the dassificati^n task easy enough 
binder most circumstances. One could sayVlh^ person cpade an errQr here 
because thev didn't have such and >dch a sub-skill^anckso on. Nbw if one 
can't relate putative sub-skills to actual reading ta>ksoi<iemands, one has the 
beginnings of a problem/ For the possibility is r^isecfeth^t some of the 
objectives are truiv irrelevant to the demands w rittcn m^ter^ls|)Jftce upon the 
reader Thus, the set of skills selected for testing pn objectives- re fere need 
measures may well not be* representative of ihe-large scVof ski(Js needed for 
functional literacy. . ■ ^ % 

The second phase of tjje analysis mak.e6^ie samepo*tf£slgJ more forcefully. 
Examinees were asked to elaborate ^n their answers as they Uent along. Oiher 
than vocabulary and item format, the answers on 'the whole were not * 



Draw a ci^rle around the sosxle 




«tuch vou :e-l 15 ur.sif» :o drink fr«a. 




2-., Oiae-is to ticks,! is 'four 



(c: 3 

Cd) : 



N / 



liners the sisa b{f the road^vould 



report *p ;. w 






f - 






FUNICUUA 






CAACINOGE.NS 


PARTS ' 









Figure 6 w 
II ' 



- L 



anjenable to any rigid sort of categorization or explanation* In fact the 
responses tcf a particular item were often unique to one individual respondent, 

Several examples of the rather unique way in which individuals respond are 

oTfered by Murphy (19751. They bear repeating here*: * ; 

• j * 

1. In a list irPwhTch the respondent is ,to qhoose ah en\fy 

corresponding td baby's Sot hes; the entry hampers appeared. \ 
A respondent who chose that-entry explained that he thought 
.hampers might be like "Pampers" a commercfal product of 
^$Jisposablc baby's diapers. 

2. A list contained several amounts of alcohol and the effects 
associated with drinking such amouruts. A respondent was 
asked" to circle the umount associated wijlr a given effect He 
circled a greater amount an<Jrff.ive as his reasonhis disagreement 
with the chart He'judged "that the effect would be associated 
with a greater amount of alcohoj. * . 

3 A doctors bill Usted the amount owed, A respondent circled a . 
higher amount listed on the bill becftusg it corresponded more 
cloSely ia her own la<esf doctor's bill. . 

Do thcsojrespunses appear to be due to the absence of arfy of the sub-skills one 
most offen sees on tests of reading comprehension? 1 think the answer woQld 
have to be no. Objectives-referencedtestsat ifie moment simplyado not possess 
f |^content validity required of a type 1 measure. It should be mentioned that 
me violation of content validity being considered here could also be construed 
as a violation of construct validity.' ' 

The contend validity of objectives-referenced measures is threatened forstill 
another reason Ultimateiyywe want to be able to generalize from the type I 
measure to the JMnain under consideration. Consider a behavioral objective 
such as, "The student shall be able to spell 80 pe'rcfcnt of the list 6f 50 wflrdj 
presented to him ir>a period of Jess than ten minutes." Such an objective says, 
nothing about the domain of words from which the test items are selected. As 
such we cannot immediately generalize to some larger set^of words. Outf^ 
knowledge about the students performance is confined to the list ofworc&on 
the test, Whentne domain of behav tors is not clearly established^ one has what 
Popttam (1974) calls a 'cidud-referenced test/ 

Construct Validity . ^ * 

Three related criticisms have been leveUed against theconstruct validity of 
various of the standardized tes(s. It w»U>be remembered that the construct 
validity of aninsttument is threatened when an item purportsjo measure one 
cognitive activity but actually measures another. Tfte first of these related 
criticisms is aimed specifically at the multiple-choice format of most 
standardized tests. "The critics find all manner of things wrong, with the 



multiple-choice format (see for example Hoffman, 1962). We will concentrate 
on only one. In particular, I will focus on that enticjsm which faults the 
multiple-choice format for not allowing a student to indicate his or her 
reasons for choosing a particular item. As such, one never know* wh> a 
student answered the item correctly, or for that matter incorrectly v ln such a* 
case one could say that the measure is hierarchically opaque. If ansvfrers were 
unambiguously right or. wrong, there woulcfbe ojie less reason to argue 
against the multiple-choice format. But answers are not always clear cut. 

A particularly invidious sort of item is the.verbal analogy. We havcalrcady 
seen an item of this type. (Figure 6, item #2) Note that one can easily provide 
reasons which lead to the choice of any one of the alternatives. Choice (d) is t 
correct if one argues that the value of a nickel is one^ialf the value of a dime, % 
j ust as two is one-half of fou r. Choices (a), ( b), and (c) a re correct if one decides 
that ten, the dime, is to five, the nickel, as even is to odd. Choice (c) is correct if 
one decides that .just as a nickel is the least smallest coin less than a dime, so 
there is a least smallest integer less than four, namely three. Choice (b) is 
correct if one decides a dime is 5 more units than a penny just as 4 i^five more 
units than I. And choice (a) is correct if one decides that a <hme,'which has 
four letters, is to nickel, which has six letters, just as four, which has four 
letters, is to eleven, which has six letters. 

t Note that this is a problem intrinsic to tfce testing of verbal analogies with 
the multiple-choice format. One can almost always (I hesitate to say always; 
Jjfcd reasons for choosing one alternative over another. And in general the 
multiple-choice format is to be avoided. One simply .cknnoT know wjbcthcr 
students are identifying a 'correcV answer for the wrong reasons, or 
identifying an^incorrecY answer for the right reasons. The threat to construct 
validity is potentially enormous when one fails to identify the reasons a person 
has chosen a particular alternative. 

A second related criticism of construct validity again focu'ses ort the 
hierarchical opaqueness of such measures. Remember that the measures are 
being called opaque because it is simply not possible' to determine why. a 
student has\hosen a particular alternative. No'w suppose one wanted to know 
whether an English-speaking person coufid broad jump four feet. One might 
measure a four fooj spah, pujtjng markers at the beginning and end of the 
span Orle would then go out and accost the first friendly neighbor run into. 
But instead of asking him or her to jump the four feet in English, suppose one 
asked him or her to jump the four feet in Chinese. WeM it is quite clear that the 
person who speaks only English will not understand the request. It is equally 
clear that the person may well be able*to leap tfie full four feet. The moral of 
the story js apparent enough. *If one wants to test a person's broad jumping 
skills one doesn't at the same time test his language compreherfcion skills." 

But what relevance does this have to measures ofreadingachievement.and 
in particular to the issde of construct validity? Figure 7 may contain ordinary 

13 ' 

k:' ' ' 18 *' 



* enough phrases for the members of this audience. But for one gr 4 oup of 
' examinees the list of Words was far from ordinary. The list oi words were as 

foreign to them as Chinese is to a person that speaks onI> English, AlHhe 
*s wefrds appeared a* part of the item ^tem,on a test adminrsteted by the^Educa- 

* tional festmg Service (Murphy. 1975) The lest was a measure of functional 
A 4iterac> skills It was determined in later analyses of other individuals that 
t these phrases were respoasible tc\£ many errors. Unfortunately, these error 

f analyses were done after 10.000 individuals had already been- tested The test 
was not % supposed to manure an examinee's understanding of the stem Yet 
v because the measure was hierarchically opaque it did indeed do so 

One no sooner leashes this argument on ohe's opponents, than one's 
opponents pipe up with what appears to be a cogent counterargument They 
claim that the fault lies not with the multiple-choice format but with the failure 
of the test makers. The counterargument tails for a proleptic defense of sorts. 
While 1 grant the opponent's counterargument in principle, 1 >eeno reason to 
admif defeat in fact Remember thei,ondusions of the error analysis done by 
the Education Testing Service reported in the previous section. The'errors'so 
to spca^ were unique to each respondent Many of the alternatives might ha,ve 
been considered correct if the examinees hall been allowed to\ oice the reason 
- for their choices Since the reasons which lead to the particular choice of an 
- , alternative are so unique. It seems most unlikely that one can design a valid 









to call up 


* 

* ingredients 


to Operate 


K 

transplant 


liver - 


classified 


apparel 


firearms 


fuel 
stance 


commencement 


locker 


lives 


extinguisher 


minimum 


toll 


ingestiop 


# * severe 


injection * 


correspondent 


,mild 


series 


tcr'fill in 


Whom 


creed 

misstatement of fact 


come 'together ' 
, permanent 


experience 

m 


confronts 


, pesticide 




redipe 


circle ' 
fourth 


» . * 




Fig uk e 7 





o 

ERIC 



14 1.9 



( 



multiple-choice tcst»»^ multiple-choice test which Jeaves o<**oom»for the 
idiosyncratic answer. , j, % 

• This s^me criticism extends to other situations. \ye say We are measunng a 
behav ior such as determining the main idea of some prose passage. But can we 
infer that individuals who did not get ttye main idea are deficient in "main 
idea^^kiHs 0 Profcably not We are more hk'el> )neaVunng something like 
vocabulary ^We say we are measuring a beho^J^which has something to do. 
with drawing inferences But again* can we mfelsthat the person who did not' 
circle the correct <frswei isdeficient in referenping^kills? Again, probably not. 
,We are just as apt to b<* measuring gefteral fafnilian'ty with the test material. 
Some individuals migh^choose toVgu6*at this point that the notion of 
abstract skills such as "inferencing" or "getting the main id^a" are empty 
notions by themselves For example one does not draw inferences in a 
vacuum One draws inferences with rcspecVfo a particular content or written 
passage therefore, it it legitimate to consider'vocabulary as a component or 
irtferencing skills I know of no'hard and fast argument against such a 
position Bufl do know tharsucKa position can only muddy the waters. Fonf 
I wanj everyone to do poorly ,on some test of inference. I can simply make the 
passage abstruse enough. In short. I can easily make it difficult if not r 
impossible tb^know. whether 'studertts do indeed - possess anything like 
irfferencmg skills. 

The third and* final criticism is again focussed on the hierarchical 
opaqueness of the standardized measures. However, this time their potential 
for cultural bias is at issue. The existence of cultural bias m tests affects their 
construct validity in the^me way the existence of obscure vocabulary affects 
the construct validity. Some studies have shown little change in the 
perfof mance of students w hen the test is rew ntten in the dialect favored by the 
Students The study 1 am going to report does fifrd a change,/a very/ large 
change The study (Thurmond, 1977) was an attempt to measure the effect of 
black dialect on reading test performance of black and white high school 
students Forty -six low achieving nintlTgrade students were admiwstered a 
standard English form and black dialect form of the reading subtest of the 
Stanford Diagnostic Reading Test. The dialect form was written so that the 
written language of the test approximated the^exact ortfl sentence pattern of 
the blatk students taking-the test. Results showed that black" students 
administered the detect form did si^ificantly better (.05) than black Students 
administered the standard English mrm. White students did significantly 
better than black students on the standard English form of the test. The means 
are reported in figure 8. The results are especially sinking when one realizes 
that as little as anoint earn rriake a full grade difference in reading levpK^ 

Concurrent Validity - , J \ 

There are three cofnmon ways of measuring the concurrent yahdity of 
norm-referencgd tests. The tesTt can be validated against some other already 



20 



established test. The measure tan becfompared with sqme other criterion, such 
a* judges' rating of performance. Or tlje test caqpbe validated uung what is 
caired the method of contrasted groups. For example, tlje test ma> be given to 
one group that is thought to possess the skills in question, and to another 
group that is thought not ty possess the skills in question. These methods are 

» appropriate to t> pe 1 measures, but incomplete. The problem comes in finding 
a reasonable cut-off scoto* something that has not been attempted until 
recentl) on an> large scare. Th;e 'finding of a reasonably cut-off score is a 
problem for concurrent validitv in the folfowmg sense. If the cut-off jcore is 
set too high, it will tail to identify all the masters of a given domain. Therefore, - 
it will have less than perfectconcurrent validit). Similarlv, if th^ cut-off score 
is set too low, it will identif) some nonmasters and masters. But 1 am getting 

. ahead of m>'self I he new jeehnjques Tor setting^ cut-off scores are best 

. discussed in a later jfcctton 

Predictive .Validity 

Al the moment we can, , a ^ u bvpass the problems involved^ irv the 
deterrnmation cjf predictive vahditv 1 his can be 4gne for two reasons. First, 
'consider extant ^measures where studies of predictive validit) have been 
undertaken In the numerous studies where test scores have been used to 
predict adult success one rarelv if ever finds significant correlations (Nathan 
.and Jennings. I97K) For example, the dissenting opinion of California 



Test Form/ Race 



Black 



dialect 

4 ■ 


V 

30.3 . 

> 


*30.7 


> - 

• • 

standard x n 
• * 

\ 


\ 22.2 

, / 

0 


•31.4 



White 



Figure 8 



ERJC 



ai , 



r- > 

Supreme Court Justice Arthur Tobriner in the celebrated Bakke case cited 
'numerous studies showing fa correlation between high 'medical school 
, admission test scores and .quality performance as a physician later in life 
("Regents of the University of California v. Allan Bakke"). < 

/ 

Indeed, the medical school's deciMon to deemphasi/e MCAT scores and grade- ' 
point averages for /nineties » especial!) reasonable and in\uinerab!e to' 
, , constitutional 4:ha!!<*gc in light of numerous studies which reveal that, among 
qualified applicants, such academic credentials bear no significant correlation to 
an individuals eventual achievement in the medical profession The findings of 
these stidics are not surprising when one considers all of the nonacademic 
qualities encrgv. compassion, empathv. dedication, dexteritv. and the like 
t which make for a "successful" ph>sician 

One more example is wi>rth citing A recently published Phi Delta Kappan 
article (Jennings and Nathan.- 1977) cited research on the complete iack of 
correlation between high score^pn the two majoi^college admissions tests and " 
success ;n adult life The two tests considered were the American College Test 
and the College Entrance Examination Board 

«• 

There is yet a v second reason we can bypass a critique 'of the predictive^ 
validity of previous measures. "In genera k such measures have no 1 ! been 
constructed with regtfd to the sort of predictive validity we have in mind. 
^Until recently no tests'that I know^of have been produced specifically for the 
"purpose of 'predicting adult success in life. So. we cannot criticize earlier 
measures simply because few are around to criticize 

Summary * 

To summarize, we'have examined the various threats to the validity of type 
I treasures posed by traditional, standardized tests. It is not the name of the 
^«u^o njuch as the planner of construction and the item format which ' 
identifies a measure as particularly offensive. The content validity arui 
construct validity of some present measures was found wanting. The need for' 
new approach to concurrent , and predictive validity was noted. 
Appropriately enough, it is now time to consider just such new approaches. - 

*» 

Recommended Type I Measures 

\ . * ' 

Cntioism is abundant. Constructive criticism is a bit more dear. And 
creative, plausible alternative suggestions and solutions are hardest to come 
J>y Fortunately, this is one of those infrequent and happy occasions when • 
alternatives are available and cheap. Of course, the relative abundance of 
alternatives to established ways does not preclude criticism. But at least it 
leaves the road fo constructive action pav'ed with possibilities. As in the last > 4 
section, my remarks will all fall under the general rubneof validity, first then 
we turn to' a discussion of the way s in which oae might go about improving the 
validity of type I measures. 

% 17 * 



ER?C ' 22 



Content Validity ? • 

Norm-Referenced and objectives-referenced tests were seen as threats to t*he 
content validity of a type I ,measure. In one v?ay or another/^he norn- 
referenced and objectives-referenced tests Biased the cpntent of these 
» measures. In general, the questions and rrfatenals on these tests are not 
representative of the universe of written questions and materials, the 
abstract, the steps one must take to help guaranteerepresentativ eness are clear * 
* enough (Ebel, 1962, Hively, Patterson, and Page, 1968) The steps are 
summed up by Hambleton et al (1978): 

I. The doHrtain must be specified clearly enough so that all items 
whid/could be written frcjm the content (domain to be tested 
st be written or known in advance ctf the final item selection 
process. k • * ' 

3. A random or stratifiecLrandom sampling procedure must be 
used m the item selection procefss. 

While these goals remain models to strive, for. t hey t are rarely if ever 
approached in practice So we need ways uf approximating these goals. Two 
such approaches are discussed. % 

The first approach might be referred to as a hands-on attempt to rptain 
content validity. Something akin to the research done by Dr. Stitcht, at 
present an Associate Director of the National Institute of Education, seems 
desiratvl^see \ ineber et al . I9" 7 !) A brief review of this work is in order. In 
1966. tip I niteci States <\rmy initiated Project 100.000, Up* to 100,000 
individuals were to be let into the armed services who wotild previously rfave 
failed to* gain entrance fur reasons ol health or measuied intellect. The military 

*** needed tf) know how rnuch. if any. literacy training t^iese men would require. 
A sample of the Project 100,000 men suggested that much work lay ahead. 
Approximate^ 30 percent of the sample read below the fourth-grade level 
while almost 70 percent read below the sixth grade level By themselves these 
figures mean little I he Army did not know the^eading demands of the 
vanoas occupational specialties il could expect thVProject /OO.OOO men to 
enter. Nor did Che Army know whether the scores on standardized tests of 
reading achievement were good predictors of job performance. Therefore, the 

- Army sougfit to obtain information concerning the literacy demands of 
military jobs andMhe predictive power of reading and other related tests. 

In order to determine trie actual reading demapds of personnel in Army 
jobs, res,earj-h psychologists interviewed men at work. The men indicated both 
what they read and why they read it. The most frequently cited materials and 
tasks fcei£ eventually included on what came to be called job-relaleid reading 
tasks. Trie face validity of such an approach is unimpeachable. 
iWortunat^y. it exists in practice much less frequently than one might 
suppose. 



18 



23 



/ 



. I- . • 

The second approach is more likely to be the ofieadoptedtythe majority of ' 
individuals involved in the construction of functional literacy measures 
Instead of actually sampling the domain of behaviors, judges are asked to 
indicate the relevance of an item to a particular domain. Any one of a number 
~ of schemes have been suggested. Rovinelli and Hambleton (1977) suggest 
' thr.ee procedures, any orie of which could bejeasily constructed. Forexample. 
they suggest that content specialists rate the relevance of an item to the 
domain beingKested. One computes the mean of the ratings across content 
specialistsTor>each item. Presumably onccan then agree upon some cut-off 
score below which items are no longer considered appropriate to the measures 
of a given domain One can easily compute the variance associated with each 
_ 4 item. This gives a measure of the agreement among content specialists. 

Cronbach (1971) suggests what e might be called a'duphcation experiment. A 
group of item writers is selected and randomly divided m'half. They receive the 
same information about the domain to be tested. If the domain specifications 
•are clear, and the sampling representative, then the tests should be equivalent. 
Clearly, these arepnly stop gaa measures. However, since the potential harm 
of such methods seems minimal, and since the method^may indeed point up . 
weak spots, they may be worth pursuing. It should be noted in passing that the 
beginnings of much more technically precise waysof sampling fromadortiain/ 
have\appeared in the literature recently (see for example Bormuth 1973' 
Hively, Patterson and Page. 1968). H<*vever. these methods are not yet 
applicable in any area quite as diffuse as functional literacy. 



Construct Validity 



"J 



The steps needed to improve the construct validity of type 1 measures are 
neither as simple nor initially as appealing as the steps required to .mprove the 
content validity of such measures. Many approaches are possible. I will 
present only one, v lt seems to me imperative that the student be given achance 
to explain his choice of a particular alternative while a multiple-choice test is 
in progress And furthermore, the student should be allowed partial or full 
credit for explanations which bear up under scrutiny. On the surface such an 
approach sounds at best unworkable, at w^orM indefensible. The approach 
may seem unworkable because of the long hours its administration would 
seem to entail. The approach may seen! indefensible because of the door it 
( open^o the monster of subjectivity. I hesitate even auhis moment to go 
forward with the attack But the end seems more than w^fth the ridicule that 
may stand in the way Note that 1 am not alone. 1 ndividuais who denounce the 
present standardized tests almost to a person make the same criticisms of the 
multiple-choice format that 1 have made. By implication, they must eithergo 
on to argue against all testing whatsoever, or suggest some alternative 
. approaches Unfortunately. no J generally attractive alternatives-have rolled off 
the pens of today s critics. %o 1 am left to breach the gap between the hoped for 
and the possible. 



19 



ERIC ' ■ 2 4 



. ft . . ? - 

Thave chosen lo argue for the multiple-choice format a* a way 01 testing 
~ functional literacy skills! However, I hav« added an important proviso. I have 
. suggested that the studenfbe allowed to defend his answers as he proceeds 
"through the tests. # It is* now time folMiie (o offer a brief defense of my pwn 
choices. At least four arguments suggest themselves. First, the experience is an 
instructive one for teachers a;id other individuals invplved in the 
administration of suf h an exarfi. Presumably the teachejrfs accountable for 
behaviors on the test. It is to his of her advantage to become as familiar with 
the important areas pf functional literacy as possible. On the one hand, the 
teacher involved in the type of testing I am suggesting is.conTronted with 
.% deciding what it is that constitutes the general nature of correct and incorrect 

answers. On the other hand, the teacher becomes more aware of the students' ' 
strengths and weaknesses through listening to the students'* responses. 
Second, such a way of testing still retains a fair share of objectivity. TheMtem 
stem, the item itself, and the alternatives are identical for each and every 
student. Third, the student as well as the teacher stands to gain from such a 
procedure The'construction and defense of an argument in the space bf a few 
. minutes is a skill to be valued in itself. But fourth and perhaps most important, 
the increase t#/he construct validity of such a test over traditional tests seems 
inevitable So. the procedure* I have sketched seems to be of benefit to 
everyone In our rush to avoid subjectivity'we may have lost sight of the 
importance of construct validity Wifh more objectivity came a decrease in 
construct validity Perhaps it is high time'that a more favorable balance was 
struck * ? * 

Concurrent Validity 

• r « 

The proliferation of techniques used to place. indinduals into the category . 
of master and "nonmaster is bewildering at best, and counterproductive at 
% worst. All the techniques rest on the assumption that mastery and non- 
mastery are meaningful concepts in the domain being tested. It is intuitively 
plausible that such areas as mathematics and the sciences onay satisfy this 
assumption. However, the generally arbitrary nature of cut-off scores has 
proved so troubling to some peoj>Ie that they seriously question the merits of 

f determinmgand using cut-off scores at alHHambleton, 19t8, Glass, 1978(2)). 
Nevertheless, we will assume for the moment that one can legitimately divide* 
the relevant portions of the world into masters and nonma>ters. The approach 
^to,** cut-off score determination Tias until very recently been largely based on 

f — feme form of agreement between experts in the field (see Meskauskas, 1976, 
for a good review of both this approach and the following approach). 
Recently this has been augmented by helpful statistical techniques. I will 
speak briefly of both approaches, * • * 

One of the first attempts to afflVfc at a cut-off score was undertaken in the 
late 1940 s for a University of Chicago departmental physics course(N'edelsky, 
1954). The department, whicbf taught physics courses by means of a ^ommon 
* 

, 20 



ERIC 25 



m * • 

subject outline, generated a joint departmental comprehensive examination 
. consisting of qvftr 200/ivc-chojte questions., Each of the approximately six ' 
jnstructors who were teaching sections at a given point in time were asked to 
look at the test prio^Jo th* candidate f UaKmgit. The instructors had to decide 
for each questiohVhich of the distractors the lowest passing student (in this- 
case a D Student) should be able to identify as irfcorrect. The .minimum 
passing level, or MPL for each item is^the reciprocal of the number of 
remaining alternatives For example, if irTa five-choice question only one of 
the distractors is rn^'ed as one that the lowest passing student should be able 
to eliminate, the mlmum passing level for the item is ', 4 since there are four 
remaining alternatives. Each question was fated b> ail the instructors in this 
manner For a five-thoice'item the possible values are .20, .25, .33, .50. and 
1 0 The MPL forthe examination consisted of a summation of these 
individual item' MPL values The meth$ becomes a bit more complicated^^ 
* practice, but the abcjve reflects the general ictea well enough. Figure 9 ahows 
how one might arrive at a cut-off score fof an c$am with five choices or' 
/alternatives, -five items and three judges. - 

Several criticisms of such a method quickly come to minfc (Meskauskas, 
f 976) One of thes? criticisms is' a starting point for statistical procedures^ 
cut-off scores Vote that errors of measurement did not enter anywhere into 
the discussion of. the above method. However, in 1971 Emnck noted that 
measurement errors would cause a number of non-mastexs to be included in 
the master category Ihese were tailed alpjja errors or false posftves. 
Similarly, a number of masters would be included in the non-master category, 
These were called beta errors or false negatives' lnman> cases we might kketo 
know the relative abundance of f^lse positive and false negative errors. 
Furthermore, we might well want to change cut-off scores so that they gave 
more weight to the false negative errors than the> gave to the false positive 
errors That is, we m\ghl consider it ver> important to cla%if> all masters as 
such Emrick's particular solution to this problem has since been disputed 
(Wilcox and Harris. 1977). However, 'the importance of being, able to 
distinguish between alpha and beta errors is still with us and has worked its ' 
wa> into a number Of other models for determining cut-off scWsV(See, for 
example, Fhaner, 1974) - 



Predictive Validity 




FinSllv, we cooTeToa consideration qf thp efforts taken tor increase the 
predictive validity of measures of functional literacy, to increase the extent to 
whicf)" success on the measure predicts success in adult life. There are in 

tneral two wajspf gomgabout the task. Both wavscan be illustrated by the 
>rk done at the Un^ersitv of Texas in what has tome to be called the Adult 
Performance Level Project (Adult Performance Level Project Staff, / 1973 )/♦ 
First, the predictive validity at a measure may be increased in a relatively 
simple andstra#Mforward wa>. If experts can .agree on what the minimally 
literate adult must be able to read, then these experts' opinions can be put to 



r 21 



26 



good use. Items can be placed on the measure which reflect the experts* 
opinion of what must be tested. A reasonably conscientious and careful^ 
project can go a long way toward clarify ing the content of w hat must belested 
as well as increasing the potential predictive validity of ;he measure. Before^ 
t^onfusion sets in, let me distinguish between the concern w±th content v alidity 
in an earlier section and the concern with predictive valid it^in thrs section. In 
the section on content validity I Assumed that the general domain of 
importance had already been specified. The job was to fill in the clomain with 
the appropriate content. Here IamiTotassumingthatthcgeneral>ontentarea ^ 
has been specified/Thus we are backing up a step. % > 

The manner in which the Adult Performance Level people set out theareas 
of importance is worth noting(Figure 10). They'dividcd the world into general 
knowledge areas and' basic skilfs. There were six general knowledge areas, 
occupational kn Jw ledge, consumer economics, health, community resources, 
government and law, and transportation. And there were six basic skills, 
reading, writing* listening, computation, problem solving, and interpersonal 
relations. My point in bringing up this Example is not to suggest that there if 
anything particularly good a,bout their division of knowJedge areas and basic 
skills. My point is more general. A matrix such as the one which appears in 
Figure 10 allows one to be complete, to forge ahead with some map of the 
universe. I thmk such a map is a Welcome adjunct to our intuitive notions of * 
what are anfljyhat are not minimal competencies. . : » ... , 

There is a second.way one might gp abtout increasing the predictive validity 
of measures of functional literacy. One might worry less about what the 



I ten 



Judge X 



Judge 2 
A B 



Judge 3 
A B 



Item mpL 



1 2 


^1/3 


» 3 1/2 


2 


1/3 


9 

.39 


2 4 


X 


3 1/2 


4 




.83 - 


3 c * ' 1 


1/4 


1 1/4 


■ °1 




.23 


4 \ ■> 2 


1/3 


1 1/4 






.36 


5 , .3 




3 1/2 


2 


1/2 


.25 * 


Test Minimum Passing »Level : 


\ 39J-.83 + .23 + . 


36+.2S)* 


2.06* 


« 



"A. t of choices- minimally competent student should be ablre to discard 
2« 



B: reciprocal of t of remaining items, i.e., expected value .for a'n 
item with equiprobable choice of remaining items 



FlQURE 9" 



^ minimally compet$j|flj person ought to read and more about what the 
minimally competent person shoDld have achieved in his job and other life 
* activities. The onl> thing ttiat justifies the previous procedure is the notion 
that the materials we select are indeed needed to be competent. Instead of this 
more or less subjective approach, one might identify various indices of 
competence. That is, ^person who is placed high on an index of competence t 
should score well on»n item designed to measure competence skills. Again, 
this isjust whatthe Adult Performance Level peopledid. They identified Jhree 
indices of competence, occupational prestige, education and weekly income. 
Four levels to each index were defined. Scores on an item were rtien correlated 
with level of an index ffcigures I I through 13). Items on which scores 
correlated well with the level of an index were kept in. other items were thrown 
out/ $ 

• Lest vbu think everything is turning up roses the following comment by a 
well-respected educator is in order, (Glass. 1978 (2))! ^ 

To my knowledge, every attempt to derive a criterion score is 
either blatantly arbitrary or derives from a set of/arbitrary 
■ premises. But arbitrariness is no^bogey man. and one ought not to 
^shrink from a necessary task because rt involves arbitrary 
^decisions However, arbitrary decisions often entail substantial 
risks of disruption and dislocation. Less arbitrariness is safe. 

Teachers and their consultants attempting to define "competence" 
and writing test items intended to reflect minimal levels of 

* /acquisition are engaged in a bootless and potential embarrassing 

endeavor. They arc likel> to construct a competencytbased test for, 
( graduation that is quite mapprppnately difficult. Tbe\i they will be 
Jdtced to back off and will be accused publicly cU either -not 
knowing what Agents ought to know or else /ot teaching 
students what the> ought to learn.' They are inTact guilty -on 
neithor account Mo one knovvs how well a person must read to 
succeed in life, or what percent of the graduating class ought to be 
^ able to calculate compound interest payments. 

I must confess that I agree with the spirit of Dr. Glass s remarks. It does indeed 
seem to me rather naive to assume tha^ we can, actually find a set of items 
^which m6re or less^fuarantees success or failure in life PeThaps Che whole 
notion of minimal competencies is as silly and as useless as the vote taken 
during orrc's senior >ear in high school ori*the student most likely to succeed. 
But these remarks fail generajlv outside the substance of this talk. For our 
purposes we need only note that one can seemingly take measures tfjat ■ 
improve th*c predictive validity of our rnstruments\ 

' Conclusion 

It should be clear b> now that many standardized tests'are simply vefy.poor 
measures of functional literacy. One can all too easily find fault with t"he 

9 _. : 28 



ERLC 



Reading Writing Listening Coaputa- Problem Interper- 

tion Solving sonal 

Relations 



Transoort'ition 



Figure 10 



validity of most standardized measures. And valid is just what we want our 
measures to be. 

Implicit in m> criticism o( standardized tests have been the following two 
equally important, if not more important, criticisms. First, it is simpl) bad 
econonffcs to go on testing as we do. Too man> children ma> be misclassified. 
Teachers will learn little if anything from the testing situation. Diagnosis and 
remediation may well be unrelated to the real problems of students. But bad 
economics as a criticism pales before the inherent unfairness of standardized 
tests. Students are sentenced to a test^score without a trial. Students are not 
allowed to defend thqir answers. If ar> answer is given which test developers 
did not consider, so much the worse for the student. There is no reason fot 
tests to be the arbitrary, often capricious dictators of students' lives that they 
are. Something can be done. Something £|ould be dime, for we are no longer 
innocent. , 



) 



REFERENCES 



Adult Performance Level Project Staff. The Adult Performance Level Study. Austin, Texas. 
Division of Extension (University of Texas), January 1973* 

» y 5 

Anastast, Anne. Psychological Testing. New York. MacMillan, 1965. 

Boimuth, Juhn R. RcaMtng Iitcia*.y. its definition and'assessmem Reading Rtfleanh Quarterly, 
V IX NI, 1973 (974, pp. 6-7. 

Bnckell, Henry M. Seven key notes oTi minimum competency testing. Phi Delta Kappan, V 59, 
N 9, Mav; 1978, pp. 589-592. 



24 



29 



Cawe!ti x Gordon National competency testing. A.bo£us*soUmon. Pht Delta kappan V 59 N 9 
^May 1978; pp. 619-620. ' ^ , *. t 

Crpnbach.LJ Test Validation. In R I Ihorndike (Ed). Educational Measurement (2nd edi- 
r ^ tion). Washington, D.C.^ American Council on Educa^ort, 197 L* 

*EbcI, R L Content standard test scores Educational^and^choiJtical Measurement V22 
T 1962. pp. 11-17. ' ' • • , " ' 

' . \ If V 

Ebcl, Robert L The case for minimum competency testing Phi Ddta Kappan V59 N8 AdoI 
I978,^p 546-548. \ % ' & ' ' ' 

Emrick, J A An evaluation model for master) -testing Journal of Education Measurement V8 
. 1971. pp 321-326. * , 

' « *• 

Engel. Brenda One way it can be. Today s Educahon. V66, N2. March Apn* 1977, pp *50 52. 

Explanation of multiple-choice testing Princeton. N J Educational Testing Service, 1961 
Item 9 . j 

If you waited to>apply for the 30b showr/oelow, which of the 
following application methods would +/ovt£se? 

* ' , Security OfficersVstart $2 per% 

* hour, uniforms furXished. Apply 

801 W. 2^th Str, after 6 p.m. 
Holrses Lobby -Desk, 




a. telephone All- 
fa. writtena^siication (resume) 
Xc. in 'person application . . 
d. I don't know • • 

Eighty-two percent of the sample answered" this item correctly 
while 13 percent answered incorrectly. Percdpt correct resoonses 
according to criterion variables are given in Figure 9. 



S9-| 
96 
93 
90-i 
87 
84 
81 
73 
75 
72- 
69« 
66- 
63- 
60- 
57- 
54« 
51- 
48H 
45 



CT D 0 



7 




A B G D 


A B C D 


A B C D 


A B C D 




LEVEL 


LEVEL 


LEVE> 


LEVEL 




Occupational 


Wdfekly ' 


' A0JE 4 


EOJCftXICfttL 


ft 


Prestige 


Income 


Level 


Level 





Pigure 9. Occupational Knowledge referenced item. on work: 
application prQcedure " • *«\ « 

Fjgure 1J * 



25 



Fan, Roger. Reading, what can be measured (IRA Research Fund Monograph). Neward, 
Delaware: Internationa! Reading Association, 1969. 

Fremer, John. In response to Gene C toss. Phi Delta Kappa n, V 59, N 9, May 1978, pp. 605 606.- 

Glass, Gene V. Minimum competence and incompetent m Florida. Phi Delta Kappan, V 59, N 9, 
May 1978: pp. 602-604. 

Glass, Gene V Matthew Arnold and minimal competence. Educational forum, January 1978, 
pp. 139 ^44. . 

Hambieton, Ronald K., H.Swaminatha'n, J. Algina, D.B Coulso/i. Criterion-referenced testing 
and measurements, a review of technical issues and development Review uf Educational 



Research, V 48, N I, Winter 1978, pp. 1 48. 



Hively, W., H.L. Patterson, S.A. Page. A 'universe-defined" system o( arithmetic achievement 
tests Journal of Educational Measurement, V 5, 1968, pp 275 290 m 



Item 4 



Mr. Packard wants to buy a car. 'pie Salesman s\ays that he can 

pay for £t over a year and that/ plus interest , the price will 

be $255.66. Interest is ; 

. ^ , 

a. "the salesman's salary * , 

b. the actual value of the car . 

Xc. the cost charged for handling the deal on the, -time basis 

d. a* state tax 

e. I don't know 

The percent correct response attained by the APL sample on this 
item was 70 percent* This item, like the precedin^fcnes , also, dealt 
with a comnverical term (interest). However , .the i tern di f f erentiated 
among the levels on all four group variables, The more successful 
persons were more likely to know the meaning oi interest. Percent t 
correct responses for the criterion variables are presented in • 
Pigure 4. * * 

! 



o 
« 
c 
o 
a 
n 
o 
at 



o 
o 
u 
u 
o 
u 

u 
c 
© 






A B £ D 



LEVEL, 
t Occupational 
Prestige 



ABC 



LEVEL 
Weekly 
Income 



A B C 0 

LEVEL 

ABE 
Level- 



A B C 0 

LEVEL ' 
Education 
Level 



Figure 4. Consumer economics referenced it'em on commercial term: 

Figure 12 
26 



31 



Hoffman. Bancsh The tyranny of testing. New York. Crowcll-Colhcr., 1962. 
* 

Jconings. W and J Nathan Startling, disturbing research on school program effectiveness. Phi 

Delia Kappan. March 1976. pp. 568 ?72. 
• f • 

McKcnna Bernard What's wrong with standardized testing Todays Education. March Anril 
, I977„p^36 V 

Mcskauikas. John Evaluation modcLs for criterion referenced testing views regarding maitery . 
and standard selling Review of Educational Research. V 46. N I. Winter WTG dd 
133 158. * • ' VV ' 

Murphy. Richard I* Adult Functional Reading Study (PR 75 2) Princeton. NJ, Educational 
Testing Service. 1975 

Nathan, Joe and Wayne Jennings Educational bait-and-swilch Phi Delta Kappan V 59 N 9 
Ma> 1978. pp 621 625 ' . ' 



Iteras 31-32 \ , 

This two- item -exercise was developed to test the adults' ability 
to calulate weight and price per unit in order to arrive at the taos* 
economical buy on food purchases. 

t 

Directions: Below you will see ^hree boxes of -cereal. On 
each box is printed the price and the weight. Look at the 
prices and the wpights and then answer the two questions 
oelow, please. 



89C 

El Grosso 
Cerea^ 

17 ozl 
Net Weight 



Soggy 

Woggies 

Cereal 



19 oz. 
Net Wt. 



9le 



Frasbee Oat 
Cerea^ 

18 oz. Net weight 

93e 



Iteta 31 

Which &t the three boxes of cereal is the best buy? 
•Answer: Soggy Wo^jgies*' Cereal \ 

Only 52 percent of the sample answered this item correctly, 'chi 
square values of pattern of response reached significance on three of the 
cricer^on variables, weekly Income was the exception. Adults in 
Level C of Occupational prestige ratings did slightly better than adults 
in t^e higher Level D rating. The overall trend of response was in An 
ascending patter?, indicating that adults in t-he higher levels were more 
successful in figuring which cereal was the best buy. Figure 12 gives 
percent correct responses' by levels across the criterion variables. 
(See Figure 12) . 

I ten 32 ^ • 

Which ol the/ three boxes o{ cereal contains the raost cereal bv 
weight? 1 • 1 

Answer: Soggy woggies Cereal ^ 



Figure 13 
27 




•?2 



.tationat Education A&suciauOn X&$k Forte on Testing A summary of alternatives. Toda>"s 
Educatiqn. V 66. N 2. March Apnl 1977. pp. 54 55 * 

* 4 

. L. Absolute grading standards fur objective tests Educational anoNPsychological 
Measurement. V 14. 1954. pp 3 19 

Perronc. Vito On standardised testing and evaluation In Paul L Houts (cd>. The myth of 
measurabihty New York Hart Publishing Company. 1977 

Pipho. Chris Minimum* competency testing »n 1978 a look at state standards Pm Delta 
Kappan. V 59. N 9. May 1978. pp 585 588 

f'opham. W James An approaching peril cloud-referenced tests Phi Delta Ka^p^n. V 55. N 6. 
May 1974. pp 611 614 " 

Regents oi the IniverSilv ol^alilorma Davis V Allan Bakkc Pacific Reporter 2nd. v 553, p 

1151 ' , * • 

% w 

Vinebcrg. R . T Cj Stitch. I N lavlor. J S Cay lor Effects of aptitude, job experience and 

literacy on )Ob performance summary of HumRRO work units LT1LITV and REALISTIC 

t ? R 7 1 I) Alexandria, ^a Human Resources Research Organization. Februar> 1971 

laylor, Edwin F I he tooking-glass worJd of testing Todays Education. March April 1977. 
pp 39 .44 



V* ilcox, R and C W H*arrts On Fmnck s "An evaluation model for mastery testing " Journal 
ol Educational Measurement. V 14. N 3. Fall 1$77. pp 215 218 




ERIC 



US COVERSMEST PRIMING OFFICE I930O J02 477 



33 



